RAG Text Chunker — heading & sentence aware, Japanese ready
Pricing
Pay per usage
RAG Text Chunker — heading & sentence aware, Japanese ready
Split Markdown or plain text into retrieval-ready chunks for RAG pipelines: cuts at headings, packs whole sentences up to a size limit with optional overlap, and tags every chunk with its heading breadcrumb. Handles Japanese sentence boundaries. No LLM cost.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Shinobu Otani
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
RAG Text Chunker
Split Markdown or plain text into retrieval-ready chunks. Heading-aware, sentence-aware, Japanese-ready — deterministic, no LLM cost.
- Cuts at headings first: chunks never mix sections; fenced code blocks are not mistaken for headings
- Packs whole sentences up to
max_chars; oversized sentences are hard-split as a last resort - Optional overlap between consecutive chunks for retrieval continuity
- Japanese-aware boundaries: 。!? with closing-quote handling alongside
Latin
.!?(decimals like3.14stay intact) - Heading breadcrumbs: every chunk carries
heading_pathfor citation
Input
{"documents": ["# 概要\n\n検証は三段階で行う。まず再現する。"], "max_chars": 1500, "overlap": 200}
Output (one dataset item per chunk)
{"id": 0, "document_index": 0, "heading_path": ["概要"], "text": "検証は三段階で行う。 まず再現する。", "char_count": 19}
Typical uses: chunking docs/knowledge bases before embedding; Japanese or mixed-language corpora for vector search; reproducible chunk boundaries.